Purpose¶

The purpose of this notebook is to get an overview of the data included in the dataset data/immo_data202208_v2.parquet

Summary¶

The dataset contains 22481 (9126 more) rows and 134 (26 more) columns.
We've identified that the dataset contains data on the following features:

| Feature | Columns |
| ----------------- | ------- |
| Availability | Availability (=), Availability_merged (=), Disponibilità (=), Disponibilité (=), Verfügbarkeit (=), detail_responsive#available_from (=) |
| Address | Commune (=), Comune (=), Gemeinde (=), Municipality (+9126), Municipality_merged (=), detail_responvice#municipality (=), address (=), address_s (new), Locality, location (+9126), location_parsed (+9126), Zip (+9126), plz (new), plz_parsed (new) |
| Coordinates | Latitude (+9126), lat (+9126), Longitude (+9126), lon(9126) |
| Floor | Floor (+4636), Floor_merged (=), Floor_unified (new), Piano (=), Stockwerk (=), Étage (=), detail_responsive#floor (=) |
| Floor space | Floor space (=), Floor space: (new), Floor_space_merged (=), Minimum floor space (new), Nutzfläche (=), Superficie utile (=), Surface utile (=), detail_responsive#surface_usable (=) | | Gross return | Gross return (=), Gross yield (new) |
| Plot area | Grundstücksfläche (=), Land area: (new), Plot area (=), Plot_area_merged (=), Plot_area_unified (new), Superficie del terreno (=), Surface de terrain (=), detail_responsive#surface_property (=) |
| Living space | Living space (=), Living_area_unified (new), Living_space_merged (=), Superficie abitabile (=), Surface habitable (=), Surface living: (new), Wohnfläche (=), detail_responsive#surface_living (=)|
| Environment | NoisePollutionRailwayL (+9126), NoisePollutionRailwayM (+9126), NoisePollutionRailwayS (+9126), NoisePollutionRoadL (+9126), NoisePollutionRoadM (+9126), NoisePollutionRoadS (+9126), PopulationDensityL (+9126), PopulationDensityM (+9126), PopulationDensityS (+9126), RiversAndLakesL (+9126), RiversAndLakesM (+9126), RiversAndLakesS (+9126), ForestDensityL (+9126), ForestDensityM (+9126), ForestDensityS (+9126), WorkplaceDensityL (+9126), WorkplaceDensityM (+9126), WorkplaceDensityS (+9126), distanceToTrainStation (+9126) |
| gde | gde_area_agriculture_percentage (+9126), gde_area_forest_percentage (+9126), gde_area_nonproductive_percentage (+9126), gde_area_settlement_percentage (+9126), gde_average_house_hold (+9126), gde_empty_apartments (+9126), gde_foreigners_percentage (+9126), gde_new_homes_per_1000 (+9126), gde_politics_bdp (+5256), gde_politics_cvp (+9108), gde_politics_evp (+8653), gde_politics_fdp (+9121), gde_politics_glp (+5413), gde_politics_gps (+9098), gde_politics_pda (+4578), gde_politics_rights (+5117), gde_politics_sp (+9123), gde_politics_svp (+9124), gde_pop_per_km2 (+9126), gde_population (+9126), gde_private_apartments (+9126), gde_social_help_quota (+9126), gde_tax (+9126), gde_workers_sector1 (+9126), gde_workers_sector2 (+9126), gde_workers_sector3 (+9126), gde_workers_total (+9126) |
| Price | price (+9126), price_cleaned (+9126), price_s (new) |
| Rooms | No. of rooms: (new), rooms (+8868) |
| Type | type (+9126), type_unified (new) |
| Homegate features | features (new), Volume: (new), Room height: (new), Number of toilets: (new), Number of floors: (new), Number of apartments: (new), Last refurbishment: (new), Year built: (new) |

Many features are contained in multiple columns. This and this notebook explores how they can be aggregated.

In [ ]:
# Import modules
import pandas as pd
import numpy as np
import sweetviz as sv
In [ ]:
df = pd.read_parquet(
    "https://github.com/Immobilienrechner-Challenge/data/blob/main/immo_data_202208_v2.parquet?raw=true"
)
In [ ]:
df.shape
Out[ ]:
(22481, 134)
In [ ]:
df_v1 = pd.read_csv(
    "https://raw.githubusercontent.com/Immobilienrechner-Challenge/data/main/immoscout_cleaned_lat_lon_fixed_v9.csv",
    low_memory=False,
)
df_v1 = df_v1.drop_duplicates(subset="link")
In [ ]:
# Reorder columns alphabetically and show sweetviz report
sweet_report = sv.compare([df, "V2 data"], [df_v1, "V1 data"])
sweet_report.show_notebook()
                                             |          | [  0%]   00:00 -> (? left)

Together with this analysis, which explores the contents of the columns description, description_detailed, detailed_description, table, details_structured and details, and the above comparison to the first version of the dataset we've identified new columns with data to existing features, new features and rows notated as:

  • (=) if row count is equal to before
  • (+ <int>) if row count changed
  • (new) if column is new

Availability¶

There are no new columns containing information on the availability:

Columns¶

  • Availability (=)
  • Availability_merged (=)
  • Disponibilità (=)
  • Disponibilité (=)
  • Verfügbarkeit (=)
  • detail_responsive#available_from (=)

Address¶

Municipality¶

In this new dataset there's one new column containing information on the municipality and several columns gained observations:

Columns¶

  • Commune (=)
  • Comune (=)
  • Gemeinde (=)
  • Municipality (+9126)
  • Municipality_merged (=)
  • detail_responvice#municipality (=)
  • address (=)
  • address_s (new)
  • Locality (+9126)
  • location (+9126)
  • location_parsed(+9126)

Zip Code¶

Columns¶

  • Zip (+9126)
  • address
  • location
  • location_parsed
  • plz (new)
  • plz_parsed (new)

Street¶

Columns¶

  • address
  • location
  • location_parsed

Coordinates¶

Columns¶

  • Latitude (+9126)
  • Longitude (+9126)
  • lat (+9126)
  • lon (+9126)

Floor¶

Columns¶

  • Floor (+4636)
  • Floor_merged (=)
  • Floor_unified (new)
  • Piano (=)
  • Stockwerk (=)
  • Étage (=)
  • detail_responsive#floor (=)

Floor space¶

Columns¶

  • Floor space (=)
  • Floor space: (new)
  • Floor_space_merged (=)
  • Minimun floor space (new)
  • Nutzfläche (=)
  • Superficie utile (=)
  • Surface utile (=)
  • detail_responsive#surface_usable (=)

Gross return¶

Columns¶

  • Gross return (=)
  • Gross yield (new)

Plot area¶

Columns¶

  • Grundstücksfläche (=)
  • Land area: (new)
  • Plot area (=)
  • Plot_area_merged (=)
  • Plot_area_unified (new)
  • Superficie del terreno (=)
  • Surface de terrain (=)
  • detail_responsive#surface_property (=)

Living space¶

Columns¶

  • Living space (=)
  • Living_area_unified (new, 1502 missing)
  • Living_space_merged (=)
  • Superficie abitabile (=)
  • Surface habitable (=)
  • Surface living: (new)
  • Wohnfläche (=)
  • detail_responsive#surface_living (=)

Space¶

Columns¶

  • Space extracted (+8693)
  • space (new)
  • space_cleaned (new)

Environment¶

Noise pollution railway¶

Columns¶

  • NoisePollutionRailwayL (+9126)
  • NoisePollutionRailwayM (+9126)
  • NoisePollutionRailwayS (+9126)

Noise pollution road¶

Columns¶

  • NoisePollutionRoadL (+9126)
  • NoisePollutionRoadM (+9126)
  • NoisePollutionRoadS (+9126)

PopulationDensity¶

Columns¶

  • PopulationDensityL (+9126)
  • PopulationDensityM (+9126)
  • PopulationDensityS (+9126)

RiversAndLakes¶

Columns¶

  • RiversAndLakesL (+9126)
  • RiversAndLakesM (+9126)
  • RiversAndLakesS (+9126)

Forest density¶

Columns¶

  • ForestDensityL (+9126)
  • ForestDensityM (+9126)
  • ForestDensityS (+9126)

WorkplaceDensity¶

Columns¶

  • WorkplaceDensityL (+9126)
  • WorkplaceDensityM (+9126)
  • WorkplaceDensityS (+9126)

Distance to train station¶

Columns¶

  • distanceToTrainStation (+9126)

gde_ columns¶

  • gde_area_agriculture_percentage (+9126)
  • gde_area_forest_percentage (+9126)
  • gde_area_nonproductive_percentage (+9126)
  • gde_area_settlement_percentage (+9126)
  • gde_average_house_hold (+9126)
  • gde_empty_apartments (+9126)
  • gde_foreigners_percentage (+9126)
  • gde_new_homes_per_1000 (+9126)
  • gde_politics_bdp (+5256)
  • gde_politics_cvp (+9108)
  • gde_politics_evp (+8653)
  • gde_politics_fdp (+9121)
  • gde_politics_glp (+5413)
  • gde_politics_gps (+9098)
  • gde_politics_pda (+4578)
  • gde_politics_rights (+5117)
  • gde_politics_sp (+9123)
  • gde_politics_svp (+9124)
  • gde_pop_per_km2 (+9126)
  • gde_population (+9126)
  • gde_private_apartments (+9126)
  • gde_social_help_quota (+9126)
  • gde_tax (+9126)
  • gde_workers_sector1 (+9126)
  • gde_workers_sector2 (+9126)
  • gde_workers_sector3 (+9126)
  • gde_workers_total (+9126)

Price¶

Columns¶

  • price (+9126)
  • price_cleaned (+9126)
  • price_s (new)

Rooms¶

Columns¶

  • No. of rooms: (new)
  • rooms (+8868)

Type¶

Columns¶

  • type (+9126)
  • type_unified (new)

Misc¶

Columns¶

  • Unnamed: 0
  • Unnamed: 0.1
  • df_index
  • provider (new)
  • title (=)
  • url (+9126)
  • link (=)

New features¶

There are several columns containing information that was not present in the old dataset.

Columns¶

  • features (new)
  • Volume: (new)
  • Room height: (new)
  • Number of toilets: (new)
  • Number of floors: (new)
  • Number of apartments: (new)
  • Last refurbishment: (new)
  • Year built: (new)